Skip to content

Conversation

@ryanbreen
Copy link
Owner

Summary

Fixes ARM64 boot stability by removing logging from critical paths that can cause deadlock when timer interrupts fire while holding the logger lock.

Key changes:

  • Remove logging from syscall handlers before preempt_disable() (timer interrupt + logger lock = deadlock)
  • Remove logging from FdTable::clone() which runs during fork()
  • Remove logging from ARM64 stack allocation path
  • Fix linker script: move __bss_start/__bss_end symbols outside NOLOAD section (symbols inside NOLOAD sections not visible to linker)
  • Enable ARM64 linker script in build.rs

Root cause: Logging in critical sections where interrupts are enabled creates a window where a timer interrupt can fire while the logger lock is held. Any context switch or interrupt handler trying to log then deadlocks, corrupting memory and causing wild pointer accesses (e.g., the data abort at physical address 0x8000_0345 which doesn't exist in QEMU's RAM).

Test plan

  • ./docker/qemu/run-aarch64-boot-test-strict.sh passes with ~100% success rate (60 consecutive boots, all showed "breenix>" prompt)
  • Test accepts only genuine userspace shell prompt ("breenix>"), not kernel fallback mode
  • ARM64 kernel builds without linker errors

🤖 Generated with Claude Code

ryanbreen and others added 3 commits January 31, 2026 07:48
Fix deadlock conditions caused by logging in syscall handlers before
preempt_disable. Timer interrupts firing while holding the logger lock
caused context switches to hang when they tried to log.

Changes:
- syscall/handler.rs: Move preempt_disable() before any logging
- aarch64/syscall_entry.rs: Same fix for ARM64 syscall path
- ipc/fd.rs: Remove logging from FdTable::clone() which runs during fork()
- memory/heap.rs: Document why heap must stay in TTBR1 (TTBR0 switches)
- run-aarch64-boot-test-strict.sh: Only accept "breenix>" prompt (not
  kernel fallback mode) for honest test acceptance criteria

Root cause: The RING3_CONFIRMED/EL0_CONFIRMED markers were logged before
preempt_disable(), creating a window where timer interrupt + logger lock
= deadlock. FdTable::clone() logging during fork() had the same issue.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Achieves 100% boot success rate (20/20 consecutive boots).

Changes:
- stack.rs: Remove all logging from ARM64 stack allocation path
  (timer interrupt + logger lock = deadlock during process creation)
- linker.ld: Move __bss_start/__bss_end symbols outside NOLOAD section
  (symbols defined inside NOLOAD sections not visible to linker)
- build.rs: Enable ARM64 linker script via cargo:rustc-link-arg

Root cause analysis:
1. Data abort at 0xFFFF_0000_8000_0345 was a synchronous external abort
   (not translation fault) - MMU translated successfully but physical
   memory at 0x8000_0345 doesn't exist in QEMU virt (only 512MB-1GB RAM)
2. The bogus address came from memory corruption caused by logger
   deadlock during stack allocation
3. Timer interrupt firing while logger lock held + stack allocation
   trying to log = corrupted state leading to wild pointer

The fix follows the same pattern as the syscall handler fixes: remove
ALL logging from critical paths where interrupts may be enabled.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Update run-aarch64-boot-test-native.sh to only accept "breenix>" prompt,
matching the strict test script. Previously accepted "Interactive Shell"
which is the kernel fallback mode when userspace fails - this was a
test gaming pattern that could mask regressions.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@ryanbreen ryanbreen closed this pull request by merging all changes into main in 9700f18 Feb 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants